Canonical Huffman
Introduction
This article assumes knowledge of at least static huffman. It will teach how to gain a little bit of bpb. Also, it will lead to less memory requirements and faster decoding.
Canonical Huffman
Canonical huffman defines a set of rules and then based on the lengths of the codes and the symbols it makes the codes. So you don't need to save the codes along with the bitstream anymore. What are these rules by which can make the codes? They are the following:
- Codes with shorter code length have higher values than longer ones.
- Codes with the same length increase the code value as the symbol value increases.
Those rules lead to an efficient decoding and also coding because no more ceil(log2(alphabetsize)) bits may be 1. This formula means the following: ceil() rounds to an int, log2 means (Log base10 (x)) / (Log base10 (2)), and alphabetsize is the number of symbols which you'll code.
Pseudo code
The first thing we need are symbols and code lengths computed via huffman, then you can start to do the codes. The pseudo code to do that isn't that difficult. Look:
- First compute the first code for the maximum length, it is 'max lenght'
bits long and its value is all 0s.
- Set last_#_codes to the number of codes that the maximum length has.
- Repeat from Maximum length to 0.
- code = code + last_#_codes
- code >>= 1
- current length start_code = code
- last_#_codes = number of codes in current length
Sort all the lengths and symbols alphabetically. Then go through the codelength and assign to the current code the start code value of that length, further increment that value and repeat. (Note that by incrementing the code value, you meet the second rule.)
Want an example? Let's say we have a file which huffman codes and the code lengths are the following:
Symbol Length Code
a 2 11
b 2 00
c 3 011
d 3 101
e 4 0100
f 4 0101
i 4 1000
g 5 10010
h 5 10011
Note that we'll not need the codes, so we don't need to compute them, only the code lengths. Also note that in the following process we don't need to keep track of the actual code length. (I assume that you have somewhere else in memory an array with the symbols and their lengths, and also you've probably reserved space for the codes.)
Before starting this loop we have to set up an array which contains the number of codes which have a given length. I don't think you need to learn how this is done. For our example it will be:
#codes[16]={0,2,2,3,2,0,0,0,0,0,0,0}
We have 16 entries because we assume our codes to be 16 bits long. Let's start doing the start_codes:
- count = maximum code 5
- code = 0
- start_code[count] = code
- last_#_codes = #codes[count] #codes[5]=2
- --count
- Loop
- code + last_#_codes = 0010 0+2=2
- code >> 1 = 0001
- start_code[count] = code start_code[4]=0001
- last_#_codes = number of codes in current length last_#_codes=3
- Repeat
- code + last_#_codes = 100 1+3=4
- code >> 1 = 010
- start_code[count] = code start_code[3]=010
- last_#_codes = number of codes in current length last_#_codes=2
- Repeat
- code + last_#_codes = 100 2+2=4
- code >> 1 = 10
- start_code[count] = code start_code[2]=10
- last_#_codes = number of codes in current length last_#_codes=2
- Repeat
- code + last_#_codes = 100 2+2=4
- code >> 1 = 10
- start_code[count] = code start_code[1]=010
- last_#_codes = number of codes in current length last_#_codes=0
- Stop
Remember that we don't need to keep track of the length of the current code. In this case we don't care about what the start code for the length 1 is. We'll not use it. Now we have those start codes:
- start_code[5]=00000 - start_code[4]=0001 - start_code[3]=010 - start_code[2]=10 - start_code[1]=0
And those are the symbols and lengths:
Symbol Length
a 2
b 2
c 3
d 3
e 4
f 4
i 4
g 5
h 5
Then we sort the symbols and lengths, visit them in alphabetical order, and start to assign codes:
Symbol Length Start code Code Incremented
a 2 10 10 11
b 2 11 11 00
c 3 010 010 011
d 3 011 l011 100
e 4 0001 0001 0010
f 4 0010 0l010 0011
g 5 00000 00000 00001
h 5 00001 00001 l00010
i 4 0011 0011 0100
And those codes are instantaneous ones, so it is no problem to decode them. Also both the first rule and the second are met, the second explicity, and for the first, look:
Symbol Length Start code Code Value
a 2 10 10 16
b 2 11 11 24
c 3 010 010 8
d 3 011 l011 12
e 4 0001 0001 2
f 4 0010 0l010 4
g 5 00000 00000 0
h 5 00001 00001 1
i 4 0011 0011 6
Encoder memory usage
Remember the first rule, and what it ensured: no more than ceil(log2(alphabetsize)) bits may be 1. If we are encoding an alphabetsize with 256 symbols (probably literals) then, no more than 8 bits can be 1. So if our codes are 16 bits long then the msb will be 00000000XXXXXXXX where X means it can be 0 or 1. So when saving the codes in memory we just need to save the low part of it, because we know that the high part will be always 0. So when we have to write a code, we check if it's above 8. If it is, then we write the low part, and for the high just zeroes. We can even make make a special putbits routine which only shifts but doesn't OR (because we only write 0). Before that our data structures looked something like that:
- Byte, length of the code
- Word, code
Thus making it not fit in word boundaries, so we usually had to make it fit in dword boundaries with something like that:
- Byte, length of the code
- Byte, dummy
- Word, code
Accessing it was just a matter of a shift left 2. But now we know that we only need to store in memory the low part, so we can save it in that way:
- Byte, length
- Byte, low part of code
Thus making it fit in word boundaries and making it easy to access, also we could load it at once, and using less memory reduces the cache misses.
Saving the header
Well, this is the advantage of canonical huffman: We don't need to store the codes. Of course we still need to pass to the decoder the lengths and symbols. It can be done in the following way: First there's 16 bytes which mean the symbols that a given code length has. After that there's the symbols starting with those for the code length 1, then 2,... in alphabetical order. It's quite clear how to decode this, isn't it? Of course once you have the lengths and symbols you do again the codes and you can start to decode.
Closing words
The fact that long codes are filled with zeroes may lead to fast decoding. Look at http://www.compressconsult.com/huffman/.
If you find any error in this article or want something to improve email me.